Handling Unknown Words in Arabic FST Morphology
نویسندگان
چکیده
A morphological analyser only recognizes words that it already knows in the lexical database. It needs, however, a way of sensing significant changes in the language in the form of newly borrowed or coined words with high frequency. We develop a finite-state morphological guesser in a pipelined methodology for extracting unknown words, lemmatizing them, and giving them a priority weight for inclusion in a lexicon. The processing is performed on a large contemporary corpus of 1,089,111,204 words and passed through a machine-learning-based annotation tool. Our method is tested on a manually-annotated gold standard of 1,310 forms and yields good results despite the complexity of the task. Our work shows the usability of a highly non-deterministic finite state guesser in a practical and complex application.
منابع مشابه
Generating an Arabic Full-form Lexicon for Bidirectional Morphology Lookup
We describe the generation of an Arabic full-form lexicon and its conversion into a two-level Finite State Transducer (FST) for morphology analysis and generation. The implementation of morphological lookup is based on a representation of the relevant data in the form of a FST, for which generic implementations exist that facilitate the integration into larger software systems for natural langu...
متن کاملProbabilistic Arabic Part of Speech Tagger with Unknown Words Handling
Part Of Speech (POS) tagger is an essential preprocessing step in many natural language applications. In this paper, we investigate the best configuration of trigram Hidden Markov Model (HMM) Arabic POS tagger when small tagged corpus is available. With small training data, unknown word POS guessing is the main problem. This problem becomes more serious in languages which have huge size of voca...
متن کاملUsing foma for language-based games
This paper describes two examples of how finite-state technology (FST) commonly used in computational morphology can help implement language-based games. The tool we have used is foma an open-source toolkit, similar to previous Xerox/PARC finite-state tools. FST tools have been widely used to describe the morphology of languages and to implement spelling checkers and correctors, especially for ...
متن کاملHandling Unknown Words in Statistical Latent-Variable Parsing Models for Arabic, English and French
This paper presents a study of the impact of using simple and complex morphological clues to improve the classification of rare and unknown words for parsing. We compare this approach to a language-independent technique often used in parsers which is based solely on word frequencies. This study is applied to three languages that exhibit different levels of morphological expressiveness: Arabic, ...
متن کاملHornMorpho: a system for morphological processing of Amharic, Oromo, and Tigrinya
Despite its linguistic complexity, the Horn of Africa region includes several major languages with more than 5 million speakers, some crossing the borders of multiple countries. All of these languages have official status in regions or nations and are crucial for development; yet computational resources for the languages remain limited or non-existent. Since these languages are complex morpholo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012